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ABSTRACT 

Using over a million and a half extragalactic spectra from the Sloan Digital Sky 
Survey we study the correlations of the Diffuse Interstellar Bands (DIBs) in the Milky 
Way. We measure the correlation between DIB strength and dust extinction for 142 
DIBs using 24 stacked spectra in the reddening range E (B — V) < 0.2, many more 
lines than ever studied before. Most of the DIBs do not correlate with dust extinction. 
However, we find 10 weak and barely studied DIBs with correlations that are higher 
than 0.7 with dust extinction and confirm the high correlation of additional 5 strong 
DIBs. Furthermore, we find a pair of DIBs, 5925.9 A and 5927.5 A, which exhibits 
significant negative correlation with dust extinction, indicating that their carrier may 
be depleted on dust. We use Machine Learning algorithms to divide the DIBs to 
spectroscopic families based on 250 stacked spectra. By removing the dust dependency 
we study how DIBs follow their local environment. We thus obtain 6 groups of weak 
DIBs, 4 of which are tightly associated with C 2 or CN absorption lines. 

Key words: ISM: general - ISM: lines and bands - ISM: molecules - dust, extinction 
- surveys - techniques: spectroscopic - astrochemistry 


1 INTRODUCTION 

The diffuse interstellar bands (DIBs) are unidentified ab¬ 
sorption features found in optical and near infrared stel¬ 


lar spectra. Since their discovery in 1919 by Heger (1922) 


and the establishment of their interstellar medium (ISM) 


origin (Merrill 1936 Merrill & Wilson 19381 they remain 


an astronomical mystery (see reviews by Herbig 1995 and 
Sarrc 2006). Hundreds of DIBs have been found so far and 
many carriers have been proposed. Polycyclic aromatic hy¬ 
drocarbons (PAHs) have been suggested by Salama et al. 
( | 1999| ) , [Zhou et aI7| ( |2006[ ) and |Leidlmair et al. ( |2011| ), |Kroto 
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et al. (19851 suggested fullerenes. Motylewski et al. | (20001, 
Krelowski et al. {2010) and |Maier et a ]. | ( |201ip suggested 


different hydrocarbon molecules and ions (HCsN + , HC4H + , 
and linear C3H2 respectively). 

Since the hundreds of known DIBs are unlikely to come 
from a single carrier (see for example |Herbig fe Leka|1991[ ), 
searches for correlations between DIBs are carried out with 
the purpose of revealing sets of DIBs that may come from 
the same or similar carrier (see reviews by [Herbig 1995] and 
Sarre|2006 1. Weselak et al. (20041 found that the strength of 


the 5797.1 A DIB correlates tightly with CH column density. 
|Ehrenfreund et al.| ( |2002| and |Cox fe Patat| ( |2008[ ) proved 
the existence of DIBs and molecular features in the spectra 
of stars in the Magellanic Clouds and M100, respectively. 
|Kazmierczak et al.| (|2009[) have found a correlation between 
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2 Baron et al. 


the full width half maximum (FWHM) of the 6196 A DIB 
and the rota tional temperature of the C2 molecule. |Weselak| 


et al. 


(2014) found a correlation between the 4964"A DIB 


Poznanski, Prochaska & Bloom (2012) and Baron et al. 


(20151 (hereafter paperl) use millions of SDSS spectra of 


and CH and between the 6916 A DIB and CH+. 

The concept of DIBs families was first described by 
Krelowski & Walker] (19871 and was since developed by oth¬ 


ers. Some of the studies compute the correlation of equiv¬ 
alent width (EW) or central depth (CD) for pairs of DIBs 


(Chlewicki et al. 1986 

Josafatsson & Snow|1987 

Cami et al. 

1997 

|Moutou et al. 

1999 

Weselak et al. 2001 

Cox et al. 

2005 


Irync 

al & Wszolek 20071 McCall, Dros- 

back & Thorburn |2010; Xiang, Liu & Yang 20121, other 


studies use the correlation with color excess or UV radia- 


tion to achieve the task 

([Chlewicki et al. 1986| Krelowski & 

Walker 1987; |Josafatsson & Snow] 1987; Krelowski fe West- 

erlund 1988 Vos et al. 

2011 

Kos & Zwitter 2013 

1. More- 


over, DIBs with complex structure were studied and their 
structures were compared ( Jenniskens fe Desert|199 3). DIBs 
strength is the result of a combination of the properties of 
every physical environment. As the line of sight toward an 
object usually samples several intervening clouds, studies 
typically concentrate on observing objects with only one in¬ 
tervening cloud with a known extinction and UV radiation 


field (Cami et al. 1997). 


It has been proposed that DIBs spectroscopic families 
are composed of few strong and a number of weak DIBs 


(Moutou et al. 1999 Fulara & Krelowski 2000). Recent 


studies have divided DIBs into three spectroscopic fami¬ 
lies: (1) DIBs 5797.1 A, 5849.8A, 6196.2A, 6379A and 
6613.7 A which have narrow profiles with a substructure, 
(2) DIBs 5780.6A, 6284.3A and 6204.3A which show no 
apparent substructure and (3) DIBs 4501.8 A, 5789.1 A, 
6353.5 A and 6792.5 A which are weak DIBs for which a 


Krelowski & Walker|1987 

Cami et al. 1997| Porceddu, Ben- 

venuti & Krelowski 1991 

Moutou et al. 1999; Wszolek & 


Godlowski 2003). DIBs in these families correlate reasonably 


galaxies and quasars to study the properties of the MW 
NalD absorption doublet and DIBs. They show that even 
though every spectrum has a low SNR, by binning the spec¬ 
tra in large numbers (typically more than 5000) one can 
detect the strongest DIBs and measure their EW with a 
17%-20% accuracy. Here we use the central depth (the depth 
at central wavelength, CD) of DIBs instead of the EW and 
study the correlation of DIBs and reddening. We show that 
this method is better suited for our low SNR data and re¬ 
sults in correlation uncertainties that are 2-3 times smaller. 

As observing capabilities increase, hundreds of DIBs 
need to be studied in hundreds or thousands of spectra. This 
makes grouping and classification via traditional methods an 
intractable problem. Machine Learning was specifically de¬ 
signed to reveal patterns in large datasets and reduce the di¬ 
mensionality of such problems. We construct an algorithm 
that divides the DIBs to spectroscopic families based on 
their pairwise correlations. The core of the algorithm is the 
Hierarchical Clustering algorithm, which divides objects to 
groups based on the distance between them, as defined in 
some space. 

We present the data and our method in section [2] we 
then study the correlation between DIBs and dust in sec¬ 
tion [3] We develop an algorithm which computes the pair¬ 
wise correlation matrix of the DIBs and divides them into 
spectroscopic families in section [4] where we also test the 
algorithm on known atomic and molecular absorption lines. 
We review and discuss our findings in section [5] We dis¬ 
cuss the extensive simulations we perform to quantify our 
detection limits and uncertainties in appendices |A| and [B] 


2 STACKING SPECTRA 

Our reference list of DIBs includes 197 lines which we corn- 


well among the member s in the group bu t not with those pile using the catalogs of |Jenniskens fe Desert (fT994]), [Jen 


from other families (see Cox et al. 2005 for a summary). 
The pair 6196.2 A and 6613.7 A shows a particularly good 
correlation of 0.98 ± 0.18 and was studied by|Moutou et ah| 


niskens et al. 

(1996), Krelowski, Sneden & Hittgen 

19951, 

Hobbs et al. 

2008) and Jenniskens et al. (1996). 


(19991 and McCall, Drosback & Thorburn (20101, although 
|Galazutdinov et al.| ( |2002[ ), using high resolution and high 
signal-to-noise ratio (SNR) spectra, argue that the ratio of 
EW of the two is not always constant. 

Most of the DIBs studies are based on observations of 
a small number of stellar spectra, usually at low Galactic 
latitude and high extinction. Several studies have also used 
extragalactic sources, such as nearby galaxies or supernovae, 


to study the DIBs in the Milky-way (MW) (Welty et al. 

2006; 

Cox et al. 2006; Cordiner et al. 2008, 2011. van Loon 

et al. 

to 

o 

I— 1 

CO 

i. In recent years there has been an increase in 


The SDSS 9th data release (DR9) comprises of extra- 
galactic spectra that cover an area of more than one forth 
of the sky, most of which are at high galactic latitudes 
(30° < b < 90°). Since SDSS spectra are too noisy and 
have low spectral resolution to show the DIBs we stack them 
into bins of thousands of spectra. The detailed stacking pro¬ 
cess is described in paperl. Briefly, we use the galaxies and 
quasars with redshift z > 0.005 from the 9th data release 
of the SDSS which results in more than 1.5M spectra. The 
spectra are interpolated to a uniform grid between 3817 A 
and 9206 A with resolution of 0.1 A. We then normalize each 


the use of large samples (hundreds and more) of objects in 
order to draw more general conclusions regarding DIBs (see 


for example Friedman et al. |20. 

LI Xiang, Liu fe Yang 2012, 

Kos et al. 2013, Zasowski et al. 

2014, Maiz Apellaniz et al. 

2014 Lan, Menard & Zhu|2014 

and 

Weselak et al. 2014 

)• 


spectrum using the Savitzky Golay algorithm ( Savitzky fe 
|Golay|1964[ ) and group spectra by different parameters and 
combine them by calculating the median at every wave¬ 
length. 

For the correlation of flux with E (B — V) in section|3]we 
group the 1.5M spectra by their E (B — V ) values using the 
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Using ML to classify DIBs 3 


dust maps of Schlegel, Finkbeiner & Davis (1998) as done in 
paperl. Each of the 24 stacked spectra we obtain is based on 
stacking about 70 000 spectra, and has a SNR greater than 
240. For the pairwise correlation between absorption lines in 
section[4]we group the spectra by their coordinates and their 
E(73 — V) (see paperl for details) and result in 250 stacks 
spectra, each spectrum is a median of 5 000 spectra with 
a SNR near 180. We test different bin sizes for the latter, 
ranging from 1000 to 40 000, and find that the results are 
consistent. Furthermore, we find that stacking the spectra 
by coordinates only, regardless of the reddening, yields the 
same results. 


3 FLUX CORRELATION WITH E(B-V) 
3.1 Method 


DIB strength is usually represented by the EW of the ab¬ 
sorption line (see for example Friedman et al.|2011 and Kos 
fe Zwitter||2013| ) or by its CD ( |Moutou et al.||1999] |Wese- 
lak et al.||2001 1. Both of the methods have advantages and 
disadvantages that vary according to the particular study. 
As shown in paperl, in SDSS spectra, EW can be measured 
only for EW > 5mA[^] and its use imposes significant un¬ 
certainties due to the difficulty to determine the continuum 
level. Furthermore, the EW is usually measured by integrat¬ 
ing over the flux or a fitted profile (Gaussian, Lorentzian or 
Voight function) between selected endpoints: different meth¬ 
ods of EW measurement may yield different results, thus 
preventing a straightforward comparison between studies. 
While fitting a profile to uncorrelated blended lines intro¬ 
duces systematic correlation between them and therefore bi¬ 
ases their pairwise correlation, simple integration over the 
flux does not allow the separation of the lines and a blended 
pair must be treated as a single line. Here we use the flux at 
the central wavelength of the DIB to represent its strength 
and measure the correlation between CD and reddening. We 
show through extensive simulations that this allows us to 
study weak DIBs (EW ^ 6 mA) with the same precision as 
strong DIBs (typical uncertainty of up to 10%) and in some 
cases enables us to separate blended lines. 

We use the Pearson correlation coefficient to study the 
correlation of CD and E (B — V ), 


cov(x, y) 


(1) 


We assume that all the DIBs are in the optically thin 
absorption regime since we use objects at high Galactic 
latitudes and low dust column densities. This assumption is 
justified by the linear relations between EW and reddening 
for the strongest DIBs that were found in paperl, and by 


1 We generally consider a DIB as broad when FWHM > 2A and 
narrow when FWHM < 1.5A. We refer to strong bands when 


EW 


100 


mA 

E (B-V) • 


the lack of difference between Pearson correlation coeffi¬ 
cients and Spearman rank correlation coefficients (which 
tests for non linear relations) that we measure throughout 
our entire analysis. The flux of an absorption line is given by: 


/(A) = I c (X)e- rW (2) 

Where 7 C (A) is the flux of the continuum and r is the 
optical depth. In the optically thin regime the optical depth 
obeys r< 1 and the flux can be expanded: 


7(A)“7 a (c)(1-t(A)) (3) 

The EW in the optically thin regime is proportional to the 
optical depth: 


EW ~Il l -m dX ~ TW (4) 

Assuming that the flux of the continuum is constant we ob¬ 
tain: 

p(EW,E(B-V)) = -p(I(\ 0 ,E(B-V)) (5) 

Where Ao is the central wavelength. Lines that get stronger 
with extinction will have a lower flux and hence a negative 
correlation between flux and E(73 — V). For clarity we call 
this simply a ‘strong correlation’ omitting the fact that it is 
technically negative. Although we do not resolve the DIBs, 
the CD is linearly related to the EW if the point spread 
function of the system is approximately Gaussian, which it 
is. 

The normalization of the spectra described in section [2] 
is expected to remove any correlation between the contin¬ 
uum and the reddening and also result in a constant contin¬ 
uum level. On the other hand, the entire stacking process 
may affect the initial correlation of DIBs and reddening. 
We therefore simulate synthetic absorption lines with var¬ 
ious correlations with reddening to quantify the effects of 
the entire process on the correlation. The simulations are 
presented in appendix [A] Using the hundreds of thousands 
of lines that we simulate, we derive a function that converts 
between the measured correlation with reddening and the 
‘true’ correlation, prior to the stacking process. We find that 
the ability to detect the correlation of flux with reddening 
for an isolated absorption line depends only on the slope of 
the flux-E(7? — V ) relation and the width of the line, and 
that the initial correlation can be recovered for DIBs with 
a slope greater than HU 2 ' 9 ^^ and FWHM greater than 
0.5 A. We find that the initial correlation of a blended pair 
of lines depends on both of the measured correlations and 
can be recovered above the same slope threshold. Since we 
lack the resolution to measure the width of the DIBs we use 
the average width value from DIB catalogs to exclude DIBs 
which have widths that are below our threshold. However, 
we measure the correlation of DIBs with borderline widths 
values (e.g the measured FWHM of the 6196 A DIB is in the 
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range 0.2 — 0.8 A according to different studies) and these 
should be treated with caution. In this case, low measured 
correlation can indicate a non detection. We mark the bor¬ 
derline DIBs in the online material. 

We use the 24 spectra, stacked using the |Schlegel7| 
Finkbeiner & Davis (1998) maps, as described in section 
2j _ and measure the correlation coefficient between flux and 
reddening for every wavelength in our wavelength grid. The 
result is a continuous correlation spectrum. 


3.2 Results 

For presentation, we color a synthetic spectrum based on the 
DIB catalog according to the correlation we measure. We 
clip the colors in figures [l] [4] to strong correlation values, for 
clarity. Red regions in figures [l] and [2] mark high correlation 
of flux with color excess (negative; since in absorption), while 
yellow coloring implies little to no correlation. Blue regions 
in figures |3] and [4] mark high correlation (positive), while 
green coloring implies little to no correlation. Clearly some 
lines show strong correlation (e.g. 5780.6A and 5797.1 A) 
while others, often as strong, do not (e.g. 4428 A). Further¬ 
more, one can see the correlation with dust of interstellar 
absorption lines which are not DIBs. 

Our simulations show that we can robustly recover the 
true correlation only above a threshold in the slope of the 
flux-reddening relation. 154 DIBs are above the threshold, 
thus their correlation with reddening can in principle be 
measured. Of the 154 DIBs, 132 are isolated and 22 are 
blended in 11 pairs. 4 pairs are discarded since they exhibit 
correlations with opposite signs and 2 pairs are discarded 
since they exhibit correlations that are lower than 0.5 (see 
appendix [A| for details), thus 142 DIBs remain. A sample 
of the correlations we measure is given in table [l] the full 
table can be found in the online version of this paper. The 
correlation uncertainties we deduce for the isolated DIBs 
vary from an average of 5% for high correlations (0.8) up to 
15% for low correlations above 0.3. We find in paperl that 
the uncertainty of the EW for the strongest DIB, 5780.6 A 
is 17% (and larger for smaller DIBs) and the uncertainty in 
the correlation is of order 20%. These results are therefore 
2-3 times better than in paperl. 

Of the 142 DIBs, only 15 isolated DIBs and 2 pairs of 
blended DIBs have a statistically significant correlation with 
E (B — V) that is higher than 0.7. While the high correlation 
of the DIBs: 5705 A (pair), 5780 A (pair), 5797.1 A, 6284.3 A, 
6613.7 A and 8621.1 A is a confirmation of what was already 
established in previous studies (see for example |Friedman 
et al. 2011 Kos & Zwitter 2013 and Kos et al. 12013 1, we 


find high correlations with dust extinction for much weaker, 
rarely studied DIBs: 5975.6 A, 6308.9 A, 6317.6 A, 6325.1 A, 
6353.5 A, 6362.4 A, 6491.9 A, 6494.2A, 6993.2 A, 7223. lA 
and 7558.2 A. 

Moreover, we find a blended pair of DIBs, 5925.9 A and 
5927.5 A, which exhibit a high positive correlation between 
their flux and E (B — V), i.e., negative correlation between 
their strength and dust extinction. These DIBs are the only 


ones out of the 142 bands that exhibit such strong nega¬ 
tive correlation with dust extinction, they show correlations 
that are consistent with being similar and they are highly 
blended, thus they are likely to originate from the same car¬ 
rier, perhaps one that is easily depleted on dust. 

We find that the DIBs 4428 A, 5512 A, 5850 A, and 
6445 A are absent in our data and the DIBs 4762 A, 4964 A, 
5487 A, 6090 A, and 6379 A have a low correlation with dust 
(lower than 0.5). The above are among the strongest and/or 
most studied DIBs, most of which exhibit medium to high 
correlations with dust in the Galactic plane (see for exam¬ 
ple: Friedman et al.|2011 and Kos fc Zwitter|2013 1. Since we 
measure strong and significant correlations with dust extinc¬ 
tion for other DIBs with similar strengths and correlations 
we conclude that the above indeed behave differently in our 
data. 


Established by Krelowski & Walker (1987) and later 


studied and adjusted by a number of studies ( 

Chlewicki 

et al. 1986; Krelowski & Westerlund 1988; Krelowski, Sne- 

den & Hittgen 1995 

Cami et al. 1997; Weselak et al. 2001 

Kos & Zwitter 

2013 

), the difference in DIBs behavior in 


different environments has led to the suggestion that UV 
radiation leads to the weakening of some DIBs while not 
affecting other DIBs or even strengthening them. The ma¬ 
jor classes that are said to exist are a (not UV shielded) 
and £ (UV shielded) clouds. This partition was made us¬ 
ing the strengths of the 5780.6 A and 5797.1 A DIBs. While 
the 5780.6 A DIB appear stronger where the UV radiation 
is stronger (<7 type), the 5797.1 A appears stronger in UV 


shielded environments (/ type). Following this, Kos & Zwit¬ 
ter! (2013) divide their sight lines into a and £ type clouds 


based on the ratio of the 5797 A and 5780 A DIBs strength 
and find that this partition shows at least two different types 
of DIBs: type I DIBs where the behavior of the DIB EW and 
reddening is similar for a and £ sight lines (5705 A, 5780 A, 
6196 A, 6202 A and 6270 A) and type II DIBs where it dif¬ 
fers significantly (4964 A, 5797 A, 5850 A, 6090 A, 6379 A, 
6613 A and 6660 A). The DIBs we find to be absent or to 
have low correlation with E(B — V) are inconsistent with 
these groups and with similar groups that were established 
by other studies using the ajC, partition i.e., some of the 
type I DIBs from Kos & Z witter| (20131 are absent in our 
data, some show low correlation with dust and some show 
the opposite. The same applies to type II DIBs. 


McIntosh & Webster (19931 studied the strength of the 
DIBs 4428 A, 5780.6 A and 5797.1 A as a function of Galac¬ 
tic latitude. They found that the strength of 4428 A rela¬ 
tive to the others is greatest at low latitude and decreases 
with increasing latitude whereas the strength of 5797.1 A is 


greatest at high latitude, van Loon et al. (2013) find that 
the growth of the EW of the 4428.1 A appears to slow down 
as the EW of the 5780 A DIB grows. We show in paperl 
that the 4428 A is absent in our spectra while the 5780.6 A 
and 5797.1 A DIBs tend to have higher EW per reddening 
unit compared to the more UV shielded Galactic plane, in 


agreement with the findings of McIntosh & Webster (19931 
and |van Loon et al.| (|2013|) . We therefore suggest that the 
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Table 1 . DIB correlation with reddening 


Wavelength*[A] 

Deduced correlation 

5704.7* 

0.99 ±0.50 

5705.1* 

0.99 ± 0.54 

5779.5* 

1.0 ±0.055 

5780.6* 

1.0 ±0.055 

5797.1 

0.87 ± 0.02 

5925.9* 

-0.9991 ± 0.0004 

5927.5* 

-0.9997 ± 0.0002 

5975.6 

0.72 ±0.03 

6203.2* 

0.99 ± 0.30 

6204.3* 

0.99 ±0.31 

6280.5* 

0.99 ±0.11 

6281.1* 

0.99 ±0.12 

6284.3 

0.976 ±0.029 

6308.9 

0.762 ±0.025 

6317.6 

0.841 ± 0.022 

6325.1 

0.713 ±0.025 

6353.5 

0.839 ± 0.023 

6362.4 

0.702 ±0.025 

6491.9 

0.808 ±0.025 

6494.2 

0.904 ± 0.024 

6613.7 

0.801 ± 0.025 

6993.2 

0.704 ±0.025 

7223.1 

0.706 ±0.025 

7558.2 

0.777 ± 0.026 

8621.1 

0.841 ± 0.022 


Correlation between CD and reddening for the DIBs that 
exhibit a strong correlation. The full table is available in the 
online version of this paper. 


*We mark DIBs that are part of a blended pair. 


absence and weakening of some of the strongest, most stud¬ 
ied, DIBs is not due to UV radiation but highly connected 
to the unique environment we probe in this study. Our re¬ 
sults in paperl reinforce the findings of |Mc!ntosh fe Webster| 
(19931 and our current results add more DIBs that follow 
the same patterns. The group 4428 A, 5512 A, 5850 A, and 
6445 A tends to disappear in high Galactic latitudes while 
the group 4762 A, 4964 A, 5487 A, 6090 A and 6379 A exhibit 
almost no correlation with dust extinction (0 — 0.4). 


4 CLUSTERING LINES 


As we show above, while some DIBs correlate fairly well 
with color excess, many more do not, and there is substan¬ 
tial intrinsic scatter (see also van Loon et al.|2009||Friedman| 


|et al.||2011| |Kos fe Zwitter||2013[ |van Loon et al.||2013| and 

paperl). Furthermore, recent studies have measured DIBs 
strength in low column densities environments and found 
that the scale height of the DIBs carriers may be more ex¬ 
tended than the distrib ution of dust (p aperl for 5780.6 A 
and 5797.1 A DIBs and Kos et al.|2014| for 8621.1 A). 


Following these findings we wish to compute the pair¬ 
wise correlation of DIBs after removing the correlation with 
dust, basically using dust as a tracer for gas column density, 
which is a nuisance parameter. The result of the process is a 
197X197 matrix that represents the pairwise correlation of 
all the DIBs. We then wish to divide the DIBs to spectro¬ 
scopic families based on this correlation matrix. The algo¬ 
rithm, which takes spectra as an input and returns groups of 
lines (DIBs or others) as an output, is described in depth in 
section [XT] We test the algorithm, find its detection limits 
and characterize its precision and accuracy in appendix |B| 
We further test the algorithm on known atomic and molec¬ 
ular absorption lines in section |X2| and discuss the results 
for DIBs in section D~3l 


4.1 Algorithm 

For clarity, we describe the algorithm as it works with an 
input of 250 binned spectra, each spectrum has a median 
extinction value and 8000 wavelength elements from 4000 A 
to 8000 A with a resolution of 0.1 A. In order to compute 
the correlation between pairs of fluxes one must first remove 
the correlation between flux and reddening, which we do by 
computing the partial correlation: 


Pflf2 ~ PflA„PA„f 2 
Pflfl,A v = , - , = 

V 1 ~ p\a v \J 1 - p\„Si 


( 6 ) 


Where the extinction A v is the nuisance parameter, /i and 
f- 2 . is the pair of fluxes we wish to compute partial correlation 
for, and p a b is the Pearson Correlation Coefficient between 
the a and b variables. 

Computing the partial correlation between every indi¬ 
vidual pair of fluxes in the wavelength grid returns a 8000 X 
8000 correlation matrix. The [i,j] element of the correlation 
matrix represents the partial correlation between the fluxes 
in the ith and j th wavelength indices. Therefore, the correla¬ 
tion matrix is symmetric, positive-definite, and its primary 
diagonal is equal to one. 

Diagonals which are far away from the primary diag¬ 
onal follow a Normal distribution with zero mean. This is 
expected since randomly chosen distant wavelengths should 
not be correlated. Diagonals which are close to the primary 
diagonal measure the flux-flux correlation of nearby wave¬ 
lengths and their mean is shifted towards 1 as the diagonal 
gets closer to the primary diagonal. This bias is the result 
of our limited resolution and some systematic effects that 
are due to the calibration and analysis process. We wish 
to remove it. Figure [5] presents a sample of correlation dis¬ 
tributions for different diagonals. In order to remove the 
spurious correlation that is caused by wavelength proximity 
we normalize each element in the correlation matrix in the 
following way: 


Pnorm — 


P~ Pk 
&k 


(7) 


Where fik is the median correlation for the fcth diagonal 
and dfe is the median absolute deviation (MAD) of the fcth 
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wavelength (A) 


Figure 1 . Correlation spectrum for the wavelength range 3800 — 6500 A. The spectrum is a simulated spectrum based on the DIBs 
catalog and is colored by the correlation coefficient between flux and E(B-V), red for strong (negative; since in absorption) correlation 
(-1) and yellow for a correlation that is close to zero. The colors are clipped to correlation values of (0.5,1) for clarity. 


© 0000 RAS, MNRAS 000, 000-000 



























































































































flux flux flux 


Using ML to classify DIBs 7 



wavelength (A) 


Figure 2. Same as Figure|T] for the wavelength range 6500 — 9200 A. 
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Figure 3. Correlation spectrum for the wavelength range 3800 — 6500 A. The spectrum is a simulated spectrum based on the DIBs 
catalog and is colored by the correlation coefficient between flux and E(B-V), blue for strong correlation (1) and green for a correlation 
that is close to zero. The colors are clipped to correlation values of (-0.5,-!) for clarity. 
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Figure 4. Same as Figure]^ for the wavelength range 6500 — 9200 A. 
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Partial correlation coefficient 


Figure 5. Correlation distribution as a function of distance in 
wavelength. The flux-flux correlation of wavelengths that are sep¬ 
arated by lA (black) show a median correlation of 0.65, wave¬ 
length that are separated by 5 A (blue) show a median correlation 
of 0.1 and wavelengths that are separated by 500 A (red) have a 
median correlation of 0. 


diagonal. We thus obtain a 8000 X 8000 normalized corre¬ 
lation matrix in which the [i, j] element represent the devi¬ 
ation in a units from the median correlation. We consider 
\pnorm\ > 3<t as high correlation (see discussion below). We 
present a cutout of the partial correlation matrix and nor¬ 
malized correlation matrix around the NalD wavelengths 
in figure [6] Despite the proximity of the doublet lines one 
can see that the strong positive correlation between them 
remains in the normalized matrix. It is also apparent that 
the correlated lines produce a secondary effect of strong neg¬ 
ative correlation that fades away as the element is farther 
away from the NalD wavelength. This does not affect our 
final results. 

We then reduce the normalized correlation matrix to a 
matrix that contains only the elements of the pairwise cor¬ 
relation between absorption lines we wish to study, using 
their central wavelength. We tested several methods to ob¬ 
tain the reduced matrix. The simplest is taking the elements 
[i.j] that match the ith and jth central wavelengths. We also 
tried integrating over the wavelength range of a given line 
using its measured profile as a weight function. We find that 
the latter method smears the resulting correlation, some of 
the smearing is caused by the secondary effect of the neg¬ 
ative correlation. For the NalD doublet, for example, we 
measure a correlation significance of 4.3<r while in the bot¬ 
tom of figure [6] the significance at the central wavelengths is 
7.8cr. We therefore use the exact wavelengths to extract the 
correlation element and obtain the reduced matrix. Figure 
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Figure 6. Partial correlation matrix cutout around the NalD 
wavelengths (top) and the normalized correlation matrix (bot¬ 
tom). 


[TJpresents the 197 X 197 normalized correlation matrix that 
contains the pairwise correlation of the DIBs. The correla¬ 
tion matrix contains 19 306 unique elements, which prevents 
us from dividing the DIBs into groups manually. We use a 
Hierarchical Clustering algorithm to achieve this goal. 

Hierarchical Clustering is a Machine Learning algorithm 
that seeks to build a hierarchy of clusters (i.e., groups) based 
on the distance between objects. We use the agglomerative 
type of clustering, the bottom-up approach, in which each 
object (absorption line in our case) starts as a one sized 
cluster, and pairs of clusters are merged according to their 
distance as one moves up the hierarchy. In order to convert 
correlations between lines into distances, we perform the fol¬ 
lowing linear transformation on the normalized correlation 
matrix, such that a strong positive correlation transforms 
to a short distance, and a strong negative correlation trans¬ 
forms to a long distance: 
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Figure 7. 197 by 197 correlation matrix for the DIBs. Every cell 
represents the partial (normalized) correlation coefficient between 
a pair of DIBs after the correlation with E(B-V) is removed. Blue 
(red) cells represent strong positive (negative) correlation between 
lines. We clip the correlations in the figure to [—3cr, 3a] for clar¬ 
ity, from the original range of [—8a, 8a]. The wavelengths that 
correspond to the DIBs indices are available online. 


Dij = max(Rij) - Rij 


( 8 ) 


Where Rij is the normalized and reduced correlation 
matrix that is shown in figure [7] The transformation yields 
the matrix Dij which represents the pairwise distance be¬ 
tween DIBs. 

Every DIB starts as a one-sized cluster. We choose the 
pair with the closest distance and merge it into a two¬ 
sized cluster. We then define the distance between the newly 
formed cluster to the rest of the clusters (so far they all con¬ 
tain only one DIB) as the maximal distance between the 
members in the clusters, i.e., the lowest correlation between 
the members of the clusters. We repeat these steps until all 
the DIBs are clustered. 

This method is called complete linkage clustering, in 
which new distances are defined as the maximal distance be¬ 
tween all the members of the two new-formed clusters. The 
clusters that are formed using this method are the worst case 
result since we group the DIBs by the lowest correlation of 
the bunch. Our tests indicate that other choices for the dis¬ 
tance assigning: single linkage (minimal distance, maximal 
correlation), average linkage, and weighted linkage produce 
almost the same clustering and very similar groups. 

The result of the clustering for the correlation matrix in 
figure [7] is presented in figure [8] One can see that the result 
of the clustering contains colored branches that are sub¬ 


clusters in which all the members have a pairwise correlation 
of more than 3 a. We define the sub-clusters that fall beneath 
this threshold as spectroscopic families. A larger version of 
this figure with DIB wavelength labels can be found in the 
online material. 

The algorithm discussed above consists of several 
phases, each of the phases can be implemented in various 
ways. We perform extensive simulations throughout the de¬ 
velopment of the algorithm in order to test every step and 
find the best implementation to it. The simulations are dis¬ 
cussed thoroughly in appendix[B] by injecting synthetic ab¬ 
sorption lines with correlations among them to the raw spec¬ 
tra we form synthetic spectroscopic groups. We compare 
these groups to the groups we obtain from the algorithm 
and statistically infer the probability of the resulting group 
to be a true-positive (TP) group or a false-positive (FP) 
group. We find that the probability of a resulting spectro¬ 
scopic group to be a TP or FP depends on the correlation 
of the lines in it with reddening and on their slopes of the 
flux-E(£> — V) relation. Throughout the next section we as¬ 
sign probabilities for a group to be TP or FP based on these 
simulations. 


4.2 Atomic and molecular lines 

Known molecular and atomic absorption lines can be used 
as an additional test to our clustering algorithm. Absorp¬ 
tion lines of the same molecule or atom should be highly 
correlated and be clustered as one spectroscopic family 
(though different levels may be populated in different en¬ 
vironments, complicating the picture). We use a list of 114 
atomic and molecular absorption lines (Cox private com¬ 
munication) that contains the central wavelengths of CH, 
CH+, NH, OH, OH+, CN and C 2 molecules and All, Cal, 
Call, CrI, Fel, He I, KI, Li I, Nal, Rbl, and Till atoms. 
Following the procedure described above, we obtain a 114 X 
114 normalized correlation matrix for these lines. We note 
that this includes lines that are too weak to be detected in 
our spectra, which is a good test of the applicability of our 
method to DIBs that are also not all detectable. 

We present the clustering result for atoms and molecules 
in figure [9] 11 clusters of lines form, each containing 2 ab¬ 
sorption lines with a pairwise correlation that is higher than 
3 a. Each significant cluster is assigned a color in the figure. 
Different atoms and molecules have different label colors. We 
expect and see that clusters are almost always populated by 
a single species. 10 out of the 11 formed groups contain a 
pair of absorption lines that belong to the same molecule 
or atom while only one group contains a pair of lines that 
belong to different carriers. 

We list in table [2] the 11 groups, the absorption lines 
they contain and their probabilities to be a perfectly recov¬ 
ered group ,P(0), and a TP group. The groups probability to 
be a FP group is l-P(TP). We are able to recover 7 groups of 
C '2 absorption lines, 2 groups of 12CN absorption lines and 
the group of the NalD atom. We find that 10 of the groups 
have a probability to be a TP group that is higher than 0.5 
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Figure 9. Clustering result for atoms and molecules. 
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Table 2. Atoms and molecules groups for 3 a threshold 


Group ID WLs[A] Carrier P(0) a P(TP) a 



ByO) 


Figure 8. DIBs clustering by the partial correlation between 
their flux. The colored clusters represent the groups for which 
the pairwise correlation is higher than 3 a. A larger version of 
this figure with DIB wavelength labels can be found in the online 
material. 


1 

7720.7, 8782.3 

C2 

0.23 

0.64 

2 

8853.0, 9142.8 

C2, CN 

0.02 

0.4 

3 

7722.1, 7731.0 

C2 

0.05 

0.42 

4 

8798.5, 8804.5 

C2 

0.12 

0.55 

5 

6929.1, 9183.2 

C2 

0.13 

0.52 

6 

8762.1, 8768.6 

C2 

0.17 

0.68 

7 

8761.2, 8767.8 

C2 

0.4 

0.77 

8 

8757.7, 8766.0 

C2 

0.36 

0.75 

9 

5890.0, 5895.9 

NalD 

0.08 

0.73 

10 

7753.1, 7762.6 

C2 

0.1 

0.62 

11 

9186.9, 9190.1 

CN 

0.11 

0.56 


Probability for perfect recovery. 

'Probability for TRUE POSITIVE = P(0) + P(l) 

while 2 of the groups present probabilities of 0.4 and 0.42 
to be a TP group. Group 2, with the lowest probability in 
the bunch, is also the group which contains absorption lines 
from different carriers. 

A few conclusions are apparent. First, the clustering 
works well, in the sense that 10/11 groups are correct. Sec¬ 
ond, very few lines cluster, most remain single. We discuss 
this point further in section pj~l] Last, we see that our clus¬ 
tering algorithm is consistent and that the simulations we 
perform are representative of the real lines, since the prob¬ 
abilities point well towards our successes and failures. The 
fact that P(0) is typically very small reflects that the algo¬ 
rithm only recovers partial groups, as is apparent for exam¬ 
ple for C2 lines that are scattered among multiple groups. 
Examining the 22 clustered lines in our spectra, we see that 
they can be seen by eye, while most unclustered lines are 
not visible. 

4.3 DIBs results 

The DIBs 197 X 197 clustering normalized partial correla¬ 
tion matrix is shown in figure [T] and the clustering result in 
figure [8] We obtain 12 DIB groups, 10 of which contain two 
absorption lines while two of them contain three absorption 
lines. We compute the probability of every group to be a TP 
group, and list them in table [3] 

As found for atomic and molecular lines in the previous 
section, the probability of a group to be perfectly recovered 
is essentially null. We And that 6 out of the 12 groups of 
DIBs have a probability that is higher than 0.5 to be a TP 
group, similar to the numbers obtained for the atomic and 
molecular groups. Following the simulations and the atomic 
and molecular clustering we suggest that groups 2, 3, 4, 5, 

7 and 8 in table [3] are spectroscopic families. We use equa¬ 
tion 0 to extract the real correlations out of the normalized 
correlations, and And that the DIBs in groups 2,3,4,5,7 and 

8 have pairwise partial correlations of 0.45 — 0.6 while the 
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Table 3. DIBs groups for 3cr threshold 


Group ID 

DIBs[A] 

P(0) a 

P(TP) a 

3 

6317.6, 6325.1 

0.28 

0.74 

2 

6497.5, 6862.5, 7569.7 

0.27 

0.7 

4 

7276.7, 8026.2 

0.27 

0.7 

5 

5927.5, 7045.6 

0.28 

0.66 

7 

5535.7, 5541.0, 5545.0 

0.28 

0.66 

8 

6330.4, 7721.0 

0.23 

0.64 

1 

7357.2, 7585.6 

0.05 

0.33 

11 

7321.1, 7832.7 

0.00 

0.29 

6 

7334.3, 7935.3 

0.02 

0.27 

9 

7099.5, 7105.9 

0.00 

0.14 

12 

6998.7, 7249.3 

0.00 

0.14 

10 

6837.7, 7562.2 

0.00 

0.12 


“Probability for perfect recovery. 

6 Probability for TRUE POSITIVE = P(0) + P(l) 

DIBs in the remaining groups (the groups with probabili¬ 
ties to be TP groups that is smaller than 0.5) have pairwise 
partial correlations of 0.3 — 0.4. 

One can see that none of the strongest, most exten¬ 
sively studied DIBs, is pr esent in the final groups we present. 
We show in section 3.2 that the DIBs 5780.6 A, 5797.1 A, 
6613.7 A and 8621.1 A have high correlations with E (B—V), 
it is therefore probable that most of their signal is removed 
in the process of computing the partial correlation. For the 
6196.2 A DIB a correlation with E (B — V ) cannot be mea¬ 
sured since its slope is not above our threshold. We also 
compute the correlation between the 6613.7 A and 6196.2 A 
DIB pair, which has been shown to correlate strongly. We 
find a negligible correlation of 0.0031 between the two DIBs 
which is expected since the DIB 6196.2 A is undetected in 
our spectra, as discussed in paperl. 

Some further clues may be gained by clustering DIBs 
together with the atomic and molecular lines. We cluster the 
197 DIBs in our catalog and the 114 molecular and atomic 
absorption lines and obtain a 311 X 311 normalized partial 
correlation matrix. Our algorithm finds 27 groups of absorp¬ 
tion lines, two of them are clearly false positives since they 
contain absorption lines that arise from different molecules, 
and their probabilities are 0.21 and 0.17, reflecting their du¬ 
biousness. 16 clusters have a TP probability that is higher 
than 0.5, and essentially recover the groups found by clus¬ 
tering them separately. The only new result are four clusters 
that include a mix of molecules (C 2 or CN) and DIBs, and 
we list them in table [5] 

4.4 Optimization: more clusters, less confidence 

Our ultimate goal was (and still is) to cluster all the DIBs 
into families. However, we are limited by our SNR and sys¬ 
tematic effects that are beyond our control. Our simulations 
indicate, based on the slope and the correlation with dust 


Table 4. DIBs groups for 2 a threshold 


Group ID 

DIBs[A] 

P(0) a 

P (TP ) a 

4 

6497.5, 6862.5, 7569.7 

0.28 

0.67 

6 

5900.6, 5923.4, 6317.6, 6325.1 

0.31 

0.66 

8 

5927.5, 7045.6 

0.29 

0.60 

11 

5535.7, 5541, 5545 

0.29 

0.60 

12 

6330.4, 7721 

0.25 

0.59 

22 

5524.9, 6845.3 

0.16 

0.54 

1 

6834.5, 7401.7 

0.12 

0.49 

5 

6362.4, 6827.3 

0.12 

0.49 

7 

7276.7, 7558.2, 8026.2 

0.12 

0.49 

26 

6236.6, 6741. 

0.0 

0.31 

3 

7357.2, 7585.6 

0.036 

0.28 

20 

7915.1, 8763.5 

0.0 

0.28 

9 

7153.8, 7334.3, 7935.3 

0.0 

0.27 

19 

6993.2, 6998.7, 7249.3 

0.0 

0.27 

15 

6270.1, 8038.5 

0.0 

0.26 

27 

6536.4, 7695.9 

0.0 

0.26 

2 

6978.5, 7330.2 

0.0 

0.24 

25 

4880.4, 6460.3 

0.016 

0.21 

14 

5418.4, 6944.5 

0.0 

0.20 

17 

7321.1, 7832.7 

0.0 

0.20 

18 

5719.4, 7705.9 

0.0 

0.20 

16 

6837.7, 7562.2 

0.0 

0.15 

10 

6860, 7366.6 

0.0 

0.074 

24 

6689.3, 6694.5 

0.0 

0.074 

13 

5947.3, 7099.5, 7105.9 

0.0 

0.071 

21 

5362.1, 7349.8 

0.0 

0.019 

23 

4501.8, 6632.9 

0.0 

0.019 


“Probability for perfect recovery. 

6 Probability for TRUE POSITIVE = P(0) + P(l) 

extinction, that we should be able in principle to cluster 
76 molecular and atomic absorption lines, and 57 DIBs with 
reasonable probabilities. The fact that we only see 22 atomic 
and molecular lines and 26 DIBs clustering is an indication 
that their correlation is too weak for us to measure. 

A solution that would increase the number (and size) 
of clusters, would be to lower the threshold for grouping, 
currently set at 3a. Using a threshold of 2a we obtain 62 
DIBs, and 54 atomic and molecular absorption lines that are 
clustered into groups, which are the majority of the detected 
lines. However, the price one pays is a higher false positive 
rate and lower confidence for the clusters obtained. Future, 
higher SNR, data may allow for more robust clustering of 
more lines. For reference, we include a table of the 2a DIBs 
clusters (Table [ 4 J, which may be indicative, if somewhat 
spurious. 


5 CONCLUSIONS AND DISCUSSION 

In this paper we systematically study the DIBs at high 
Galactic latitude, with a sampling and coverage never re¬ 
motely approached before. Our findings can be separated 
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Table 5. DIBs spectroscopic families and their associated 
molecules* 


DIBs [A] 

P(TP) a 

Molecule 

P(TP) b 

6317.6, 6325.1 

0.74 

C2 

0.74 

6497.5, 6862.5, 7569.7 

0.7 

C2 

0.65 

7276.7, 8026.2 

0.7 

C2 

0.69 

5927.5, 7045.6 

0.66 

CN 

0.57 


*The table contains only groups with P(TP) > 0.5 and that are 
associated with a molecule with P{TP) > 0.5. 

“Probability for the DIB group to be true positive. 
b Probability for the group to be associated with the molecule. 

into two. Based on the correlation of the DIBs with dust 
maps we find the following. 

• We measure the correlation of 142 DIBs with dust ex¬ 
tinction and find that only 19 of them show a significant cor¬ 
relation that is higher than 0.7. Some were measured in pre- 
vious studies (5705 A, 5780 A, 5797-lA, 6284.3 A, 6613.7 A 
and 8621.1 A). Most of the DIBs do not correlate with dust 
in the E(£> — V) range 0 - 0.18 Mag. 

• We find a pair of DIBs, 5925.9 A and 5927.5 A, with 
significant negative correlation of their depth with E (B — 
V). The carrier of these lines is probably depleted in dusty 
clouds. 

• We find that the DIBs 4428 A, 5512 A, 5850 A, and 
6445 A are absent in our data and the DIBs 4762 A, 4964 A, 
5487 A, 6090 A and 6379 A correlate weakly with dust. This 
group of absorption lines traverses previously established 
grouping based on dust and UV radiation. Moreover, pre¬ 
vious studies reported high correlations with dust for these 
lines. Since we do detect and measure strong correlations for 
DIBs with similar strengths, we suggest that their non de¬ 
tection (or low correlation) is connected to the environment 
we probe at high Galactic latitude. 

We subsequently apply a Hierarchical Clustering algo¬ 
rithm to the matrices of pairwise correlation between lines. 

• Testing our algorithm on 114 known atomic and molec¬ 
ular lines, we find that it is able to cluster most of the de¬ 
tected atomic and molecular absorption lines in our data. 
We obtain 11 groups of atomic and molecular lines (C 2 , CN 
and NalD), 9 are robust. Furthermore, we find that the only 
group that contains absorption lines that originate from dif¬ 
ferent molecules is also the group with the lowest probability 
to be correct. 

• Our simulations show that the algorithm we develop is 
mostly sensitive to DIBs with low correlation with E (B — V). 
This allows us to study weak DIBs which are typically barely 
studied and divide them into spectroscopic families. 

• We find 6 robust groups of DIBs (table |3|. We suggest 
that each of these groups is a spectroscopic family formed 
by a different carrier. 


• By clustering DIBs with known atomic and molecular 
lines we find that 4 of the 6 DIB groups are associated with 
molecular absorption lines. The groups (1) 6317.6 A and 
6325.1 A (2) 6497.5 A, 6862.5A and 7569.7A (3) 7276.7A 
and 8026.2 A are associated with C 2 absorption lines, and 
the group (4) 5927.5 A and 7045.6 A is associated with CN 
absorption lines. 


As the study of DIBs evolves to systematic observa¬ 
tions of thousands of lines of sight, the methodology needs 
to progress as well. With the added complexity also comes 
the promise of greater discoveries. In this study we clus¬ 
tered DIBs based on their partial correlation, after removing 
their individual correlation with dust extinction. With larger 
datasets (e.g., from the Dark Energy Survey Instrument 


Levi et al. 20131, and/or better SNR per spectrum (Mafz 
Apellaniz et al.|2014|), one could compare DIBs strength to 


additional species. One could measure the correlations of 
DIBs with HI and H 2 (via CO), PAHs, and also Galactic 
latitude, for example. These correlations can then again be 
fed to a clustering algorithm, and perhaps bring us a critical 
step closer to robust DIBs families. 
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APPENDIX A: FLUX - E (B V) CORRELATION 
SIMULATIONS 


Since our spectra undergo interpolation, normalization and 
stacking by E(B — V) it is likely that the initial correlation 
of the DIBs CD with E(B — V) will be affected. We wish to 
study the effects of the entire process on the correlation of 
various absorption lines and to find our detection limits. Our 
updated DIB catalog consists of 197 DIBs that spread over 
the SDSS wavelength range. We exclude DIBs with FWHM 
values that are less than 0.5 A using the average value that 
is reported in our updates DIB catalog and additional recent 
studies (e.g. Kos & Zwitter| 2013). Apart from the blending 
that is caused by our low spectral resolution and our stacking 
of multiple sight-lines at various velocities, studies list DIBs 
that are partly to heavily blended. The latter arouse debates 
of whether these DIBs are separate and blended absorption 
lines that may be caused by different carriers or a complex 
structure that is generated by a single carrier. Blended lines 
must be examined and treated separately since their cor¬ 
relation is affected by both our stacking process and their 
neighbors correlation. We therefore divide our simulation to 
single absorption lines and blended absorption lines. 


Al single absorption lines 

Figure [All shows a schematic representation of our simula¬ 
tions, as described in details here. We construct synthetic 
lines that are represented by Gaussian profiles. To each line 
we assign a correlation with reddening, pi , that is taken from 
a uniform distribution within the [—1, 1] range. The corre¬ 
lation is constructed by varying the Gaussians’ amplitude 
following the desired correlation, resulting in normally dis¬ 
tributed amplitudes which can be both negative (absorption 
line) and positive (emission line) and is similar to the typical 
CD of DIBs (roughly 0.1% - 3% of the continuum level). We 
construct the amplitude as follows: 

A= [pi ■ E(R — V) + 1 pi • A/"(0, ue(b-v))] ■ a + b 

(Al) 

Where A f is Normal distribution with zero mean and 
standard deviation of which is the standard devi¬ 

ation of our color excess distribution. The parameters a and 
b represent the slope and the intersect with the axis of the 
flux-E(R — V ) relation and are sampled from a log-uniform 
distribution in the range [-7, 0]. We scale the Normal distri¬ 
bution to the color excess distribution in order to balance 


the correlation contribution and the noise contribution in 
equation |A1| 

The standard deviation of the Gaussian is chosen from 
a uniform distribution in the range [1.06 A, 2 A] and is con¬ 
stant for different reddening values, the limits of the range 
are the 20th and 70th percentile respectively of the standard 
deviation that we measure for DIBs in paperl, we then con¬ 
volve it with a 2.5 A Gaussian to account for our resolution. 
This range does not represent broadest known DIBs (4428 A 
for example) though we show in paperl that these broad 
DIBs are not detected in our spectra. We set the central 
wavelength of each Gaussian from a uniform distribution in 
the range [3817 A, 9206 A] which covers all the SDSS wave¬ 
length grid in order to test for wavelength dependent biases 
and uncertainties. 

We create synthetic spectra that contain the 50 Gaus¬ 
sians for every reddening value and multiply it by the 1.5M 
original spectra. The resulting spectra undergo the inter¬ 
polation, normalization and stacking process discussed in 
section [2j after which we measure the correlation between 
the CD of the synthetic lines and reddening and compare it 
to the correlation we introduced. In order to construct a full 
image of the entire wavelength range we repeat the process 
of constructing Gaussians and injecting them to the original 
spectra 4000 times, in each repetition we assign new correla¬ 
tions, amplitudes, widths and central wavelengths, resulting 
in 200,000 synthetic lines that sample the entire parameter 
space well. The 200,000 synthetic lines are injected to the 
1.5M raw spectra, which results in 3 ■ 10 11 Gaussian injec¬ 
tions in total. 


By comparing the measured correlation of the CD and 
reddening, which is measured after the injection and stack¬ 
ing process, to the initial correlation we construct, we find 
that the initial correlation can be recovered from the mea¬ 
sured correlation regardless of the width of the Gaussian 
and its central wavelength as demonstrated in figure |A2| 
Furthermore, we find that while the estimation uncertainty 
is independent of the intersect with the axis, b, for the range 
we use in equation E3 it is sensitive to the initial slope, a. 
The estimation uncertainty is up to 10% and roughly con¬ 


stant for slopes a > 10 


-. The top panel of figure 


A3 


presents the dependance of the correlation estimation on 
the slope. Therefore, the amplitude of the synthetic lines 
is dependent only on the slope (via equation Al l and the 
mean EW of each synthetic line can be reconstructed from 
the mean amplitude and width of the Gaussian. The bot¬ 
tom panel of figure |A3| presents the correlation estimation 
as a function of the mean EW, where we use the mean am¬ 
plitude of every line for the EW computation. Therefrom, 
the mean EW detection threshold is approximately 7 mA, 
which is similar to what we find in paperl. We chose to set 
a = 10~ 2 ' 9 —— as the detection threshold, above which the 
estimation uncertainty is no longer dependent on the slope. 


We divide the synthetic lines to 30 bins of measured cor¬ 
relation (p m ) and measure the median initial correlation (pi) 
per bin, each bin contains 20,000-40,000 synthetic lines. For 
a measured correlation bin of zero the initial correlations are 
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Figure Al. Schematic representation of our basic simulation procedure. 


distributed normally around zero and as the measured cor¬ 
relation increases (decreases) the distribution of the initial 
correlation in each bin becomes left-skewed (right-skewed) 
since the correlation is blocked in the [-1,1] range. The uncer¬ 
tainty of each bin is defined as the MAD of the initial corre¬ 
lations in it. We present in figure pAf| the initial correlation as 
a function of the measured correlation and a 5th polynomial 
fit which minimizes the reduced chi-square. This represents 
the effect of our entire process on the initial correlation of the 
flux and can be used to deduce the initial correlation of DIBs 
given their measured correlation. The dependance of the ini¬ 
tial correlation on the measured correlation is not symmetric 
for positive and negative correlations. While negative initial 
correlations follow the measured correlation in an almost 
linear manner, positive initial correlations have a parabolic 
dependance of the measured correlation. This effect can be 
caused by residual negative correlation of the continuum- 
level with reddening - absorption lines weaken with redden¬ 
ing (positive correlation) and the continuum-level decreases 
with reddening simultaneously which results in overestima¬ 


tion of positive correlations. On the other hand, negative 
correlations are overestimated when weak, where the corre¬ 
lation of the continuum-level with reddening is comparable 
to it, and recovers to the linear relation since the correla¬ 
tion of the lines becomes more significant as the correlation 
decreases to -1. 


A2 blended absorption lines 

We repeat the process done with the single lines but this 
time adding to each simulated line a companion line 0.5-3 A 
away, with a uniform distance distribution. This distance 
distribution was constructed by trial and error, aiming to 
describe the blended lines entirely as we justify later in the 
section. Without prior knowledge we assume that both of the 
lines in a given pair can affect each others correlation with 
reddening and therefore we aim to execute a 2-dimensional 
binning (unlike the single synthetic lines for which we per¬ 
form 1-dimensional binning) based on both of the neigh- 
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Figure A2. Deviation of measured correlation from the intro¬ 
duced correlation as a function of the Gaussians central wave¬ 
length (top) and width (bottom). 



log(EW [A]) 


bors correlations. In order to include a sufficient number 
of synthetic lines in each bin and result with a normally 
distributed measurement we repeat the process of injecting 
lines 40,000 times, a total of 2 million synthetic lines. 

We compare between the initial correlations of the pair 
to the measured correlations as a function of the wavelength 
distance between the lines and find a deviation from the sin¬ 
gle line beha vior for distances lower than 1.75 A. As one can 
see in figure A5 pairs that are further apart than 1.75 A 


ft 


Figure A3. Initial correlation as a function of measured correla¬ 
tion and slope (top) and EW (bottom) of the synthetic line. The 
thresholds of the correlation detections are 7 mA for the EW and 

10 -2 - 9 ^— for the slope (black dashed). 
maa r v ' 
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Pm 


Figure A4. Initial correlation as a function of measured corre¬ 
lation for isolated lines (black) and 5th polynomial fit that mini¬ 
mizes the reduced chi-square (red). 



1.0 1.5 2.0 2.5 3.0 


Distance between neighbors (A) 

Figure A5. Deviation in measured correlation (pm) from initial 
correlation (pi) as a function of distance between central wave¬ 
lengths of the Gaussian pair (black) and the typical correlation 
estimation uncertainty of single lines (red). 

exhibit a behavior that is similar to the single lines profile 
(figure |A4| ) with up to 5% error, which is the typical uncer¬ 
tainty we measure for single lines. Pairs that are beneath 
the distance threshold show a mean deviation of up to 15%. 

In order to examine only blended lines we exclude all 
pairs for which the distance is higher than 1.75 A, remaining 
with 1 million synthetic lines. A comparison between the ini¬ 
tial correlations with reddening of the pair to the measured 
correlation brings us to divide the pair population into two 
categories: pairs with the same correlation signs and pairs 


with opposite correlation signs. Pairs of initial correlations 
with the opposite sign tend to destroy each others’ corre¬ 
lation almost entirely - for an absolute correlation range of 
[0, 0.9] the measured correlation degenerates to an almost 
constant value in the range [0, 0.05] while the absolute corre¬ 
lation range of [0.9, 1] results in measured correlation of [0.1, 
0.4]. This degeneracy prevents us from deducing a function 
that converts measured correlation to initial correlation and 
we discard all the DIB pairs that fall into this category (5 
pairs). We find that pairs of DIBs with the same sign of ini¬ 
tial correlation with reddening have a mutual effect on each 
other. The measured correlation with reddening of every line 
in the pair can be as high as the initial higher correlation but 
it can go down to zero regardless of the initial lower corre¬ 
lation (e.g, a pair with initial correlations of 0.9 and 0.7 can 
exhibit a measured correlation of 0.6 and 0.1). Furthermore, 
we find that the line with higher initial correlation presents 
the higher measured correlation on average. We therefore 
define the line with the higher correlation as Linel and the 
line with the lower correlation as Line2 and examine how 
the initial correlation of Linel and Line2 depends on the 
measured correlation of both of them. 

We present in figures |A6| and |A7| the 2D binning of the 
synthetic lines. Figure |A6| represents the initial correlation of 
Linel (left) and Line2 (right) as a function of the measured 
correlations of the pair. Figure |A7| represents the relative 
scatter per bin. The bins were constructed by the measured 
correlation values (50 bins for the range -1 to 1), thus the 
number of lines varies from bin to bin. As noted earlier, the 
bins that represent extreme correlations with opposite signs 
do not contain lines at all (white) regardless of the initial 
correlation. The 9 remaining DIB pairs are represented as 
blue circles on the figure. We find that beneath an absolute 
correlation of 0.5 the initial correlations cannot be recov¬ 
ered due to the degeneracy of the colormap and the high 
scatter per bin, we therefore further exclude 4 pairs that 
present such correlations, these are marked with red crosses 
in the figures. Furthermore, 2 of these pairs contain DIBs 
that are below our threshold for detection. The remaining 
5 DIB pairs exhibit measured correlations that are higher 
than 0.5 and we therefore can recover their initial correla¬ 
tions by crossmatching their measured correlations on the 
plot. We chose the bins that match the 7 DIBs pairs that 
remain, 5 pairs with correlation that is higher than 0.5 and 
2 pairs which we can not measure initial correlation for, and 
plot histograms of the initial correlation for them in fig¬ 
ure |A8| Every row in the plot represents a pair of blended 
DIBs and is divided to Linel and Line2. One can see in 
the second row the DIBs 5780.6A and 5779.5A, the pair 
is represented by the lowest blue point on the left in figures 
|A6| and |A7| (since their measured correlations are -0.98 and 
-0.96 respectively), therefore the scatter in the initial corre¬ 
lations is very low and their correlation uncertainty is also 
low (see table[l]). However, the last row represents the DIBs 
6845.3 A and 6846.6 A, which have measured correlations 
that are lower than 0.5 and are represented by a blue point 
with a red cross. One can see that the initial correlation 
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Initial correlation 


Figure A6. Linel (left) and Line2 (right) initial correlation as a function of measured correlations. The blue points are all the pairs 
of blended DIBs, from this we deduce the initial correlation of Linel and Line2. Red crosses represent pairs of DIBs which we can not 
measure initial correlation for as the scatter is too big. 


in the bin that fits this pair is very scattered, their initial 
correlation uncertainty is therefore high (see table [I]). 


APPENDIX B: PAIRWISE CORRELATION 
SIMULATIONS 

The algorithm described in section |4.1| consists of several 
phases, each phase can be executed in more than one way. 
We therefore use simulated absorption lines to test different 
approaches towards each phase. Furthermore, we use the 
simulations to quantify the quality of every resulting group 
and find parameters that affect the goodness of the cluster¬ 
ing using a scoring system for the resulting groups. We infer 
subsequently thresholds to remove false-positive groups. 

We simulate a number of ‘carriers’: each carrier consists 
of 2-6 absorption lines (Gaussian functions), it has a corre¬ 
lation with color excess as well as a correlation with other 
carriers. The correlations, which are taken from a uniform 
distribution in the range [-1, 1], are expressed through the 
amplitude of the absorption lines similarly to the simula¬ 
tions in section [AT| We construct the base amplitude of the 
carrier as follows: 


A =p d • E(B — V) + p h ■ H + A(p d ,Ph) (Bla) 

H =A/’(0, oe(b-v)) 

A = \/l + Pd ' Afd( 0, oe(b-v)) 

+ \J^ + p\ ' A/j>j(0, &E(B—V)) 

Where pd is the correlation with color excess and 
A/”(0, o'e(b-v')) is a Normal distribution scaled to the color 
excess distribution. The correlation with other carriers is 
expressed through a correlation, ph, with a hidden variable, 
H, which is also drawn from a Normal distribution that is 
scaled to the color excess. A [pd, ph) represents the noise. We 
assume that lines that belong to the same carrier correlate 
perfectly. We therefore scale their amplitude, A;, according 
to the base amplitude of their carrier: 

A — A • Oi -\- bi (B2) 

Where the index i can be 0 - 6 and represents the index 
of the absorption line, a; and bi are the slope and the inter¬ 
sect with the axis and they are sampled from a log-uniform 
distribution in the range of [-4, 0] for every absorption line. 
The standard deviation of every Gaussian is chosen from a 
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Figure A7. Top panel: Linel (left) and Line2 (right) relative uncertainty in every measured correlation bin. The blue points are all the 
pairs of blended DIBs, from this we deduce the initial correlation uncertainty of Linel and Line2. Red crosses represent pairs of DIBs 
which we can not measure initial correlation for as the scatter is too big. 


uniform distribution in the range [1.06 A, 2 A] and the cen¬ 
tral wavelength is chosen from a uniform distribution in the 
range [3817 A, 9206 A] as described in section 

We divide the simulation to two separate simulations, 
the first simulation includes the carriers with 2 and 3 ab¬ 
sorption lines and the second simulation includes the carri¬ 
ers with 4-6 absorption lines. In the first simulation we 
inject 10 carriers with 2 absorption lines and 10 carriers 
with 3 absorption lines, which result in 50 injected lines per 
spectrum, while in the second simulation we inject 5 carriers 
with 4,5 and 6 absorption lines, 75 lines per spectrum. We 
inject the lines to the original 1.5M spectra and repeat the 
process of normalization, interpolation and stacking as dis¬ 
cussed in section |2] and obtain 250 coordinate-based binned 
spectra. We then execute the algorithm we describe in sec- 
tion |4.1[ we use the central wavelengths of the synthetic lines 
to obtain the reduced normalized correlation matrix and we 
perform Hierarchical Clustering on the matrix. We extract 
the clusters which form under a threshold of 3<r. 

In order to quantitatively compare between the result¬ 
ing groups and the initial groups we define a score for every 
group: (1) score = ‘O’: the final group contains the same lines 
as the initial group i.e., the group was recovered perfectly, 
(2) score = ‘1’: the final group contains some of the lines 
from the initial group and does not contain additional lines 


Al 


i.e., the group was partially recovered and (3) score = ‘2’: 
the final group contains lines that are not a part of the ini¬ 
tial group i.e., the group was not recovered. The third score 
represents the false positives sample. We present an example 
of the clustering result for synthetic lines in figure |B1| The 
top panel shows the clustering of the input lines, the mid¬ 
dle panel shows the clustering of the measured correlations 
(p) and the bottom shows the clustering of the normalized 
correlations (prior™), the labels on the x-axis represent the 
initial group index of every line. The newly formed group 
(17, 17, 10, 17) in the middle panel is an example of lines 
that were clustered incorrectly because of wavelength prox¬ 
imity since the pair of lines from the 17th and 10th group 
are located at 5297 A and 5299 A respectively. Furthermore, 
we find that 74% of the false positives in the measured cor¬ 
relation clustering are caused by a proximity of otherwise 
non-correlating lines, this validates the use of the normal¬ 
ized correlation matrix as the basis for the clustering. One 
can find the three different group scores we define when com¬ 
paring the top panel to the bottom panel: the group (13, 13, 
13) exists in both of the clusterings and therefore it was re¬ 
covered perfectly and its score is ‘O’, the input group (19, 19, 
19) was recovered as (19, 19) and therefore it was partially 
recovered and its score is ‘1’. The group (1, 1, 10) in the 
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Figure Bl. In the top panel we show an example of input carriers we simulate, each having 2—3 lines. In the middle panel we show 
the clustering result using the measured correlation matrix. The bottom panel shows that using the normalized matrix improves the 
clustering significantly and we recover most groups, with little confusion. 


normalized clustering is a false positive since its lines do not 
belong to the same input group, its score is therefore ‘2’. 

We repeat both of the separate simulations 1000 times 
and result with 125,000 absorption lines, times 1.5M spec¬ 
tra, in total. We use the lines that were clustered to groups 
and the groups scores in order to divide the lines and their 
parameters to 3 groups according to their groups score. Fig¬ 
ure [B2] presents the probability of an absorption line to be 
clustered in a ‘O’, T’ and ‘2’ groups (red, blue and black 
respectively) as a function of the Gaussian width, central 
wavelength, b parameter (intersection with the axis) and the 
size of their group. We find that the probability of a line to be 
clustered in a perfect, partial or false group does not depend 
on the Gaussians width or central wavelength. Similarly to 
section [AT] we find that the probability does not depend on 
the b parameter either. The resulting group size does af¬ 
fect the probability, though we show in section |4~3| that the 
DIBs always cluster into small groups. The strongest effect 
we find is of the slope of the relation, and on the correlation 
of the line with color excess. As the slope of the relation 
steepens the change in the flux per bin is higher than the 


average scatter. Since we compute the partial flux-flux cor¬ 
relation we essentially remove the signal that is caused by 
the dust column density and remain with the signal that is 
perpendicular to the E(£> — V) vector. As the correlation 
with reddening increases the partial correlation removes a 
bigger part of the signal, and one remains with a lower sig¬ 
nal comparable to the noise in the data. 

We present in figure [B3| the probability of a line to be 
clustered in a ‘O’, ‘1’ and ‘2’ group as a function of its slope 
and its correlation with E (B — V). The top panel in the 
figure presents the probability of a line to be clustered in 
a perfectly recovered group, the middle panel in the figure 
presents the probability of a line to be clustered in a partially 
recovered group and the bottom panel is the probability to 
be clustered in a false positive group. When the slope is 
smaller the probability of the line to be clustered in a false 
positive group increases, up until a = lO -2 ' 5 ^^ where the 
probability is almost 1 regardless of the correlation with 
E(B — V). It is also apparent that in most of the bins the 
probability of a line to be clustered in a partially recovered 
group is higher than the probability to be clustered in a 
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Figure A8. Histograms of initial correlation of DIB1 (left panel) 
and DIB2 (right panel) for a 2D measured correlation bin. The 7 
rows represent the 5 pairs of DIBs for which one can measure an 
initial correlation (represented by blue circles in figures |A6| and 
ED and 2 pairs of DIBs which exhibit measured correlations that 
are lower than 0.5 (represented by blue circles with red crosses 
in figures |A6| and . One can see that the scatter for the lat¬ 
ter is high as is indicated in figure |A7| their initial correlation 
uncertainty is therefore high (see table|l|l. 
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Figure B2. The probability of a synthetic absorption line to be 
clustered in a perfectly recovered group (red), partially recovered 
group (blue) and incorrectly recovered group (black) as a function 
of the lines initial parameters: its Gaussians width and central 
wavelength, it intersection value with the axes and the size of its 
group. 


perfectly recovered group. We use the 2D binning to obtain 
the probabilities of the measured absorption lines (DIBs, 
atoms and molecules) to be clustered in ‘O’, ‘1’ and ‘2’ groups 
in sections 14.21 and 14.31 

Since we wish to concentrate on dividing out (by prob¬ 
ability analysis) the DIB, atom and molecule groups that 
were falsely clustered, we define the probability to be clus¬ 
tered in a false group to be our false positives (FP) and 
the probabilities to be clustered in ‘0’ and ‘1’ groups to be 
the true positives (TP=1-FP). We measure the FP and TP 
rates as a function of the clustering method (complete, sin¬ 
gle, weighted and average) and find no dependance of the 
rates on the different methods. We also measure the rates for 
different clustering thresholds and find that a threshold of 3cr 
maximizes the TP/FP rate. We therefore use the complete 
clustering method with a 3 a threshold for the grouping. 

In order to assign the probability of a resulting group 
to be TP or FP we compute the correlation of flux with 
E (B — V) and the slope of the relation for the absorption 
lines. We present the slope histogram for the atomic and 
molecular lines in figure |B4| and the slope histogram for the 
DIBs in figure |B5| the threshold that we present in the fig¬ 
ures (red) is the threshold we deduce via the simulations in 
section I aTI which divides out the lines we can measure a cor¬ 
relation with E(B— V) for. Clearly, most of the relations with 
E (B — V) can be measured. However, the group recovery de¬ 
pends also on the correlation with E (B — V) and we can not 
cluster lines that correlate strongly with dust. We then de- 
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Figure B4. Histogram of the slopes of atoms and molecules from 
our catalog (blue) and the detection threshold for the slope (red). 


duce the probability of every line in the clustered lines by 
crossmatching its slope and correlation with E (B — V) using 
figure |B3[ We find that, generally, lines that form a group 
share similar slopes and correlations with E (B — V) which 
is expected for lines that correlate well with each other. We 
therefore compute the median slope and correlation with 
E (B — V) value for every group and deduce the probability 
of the group to be TP or FP using figure [B3] 
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Figure B3. The probability of a synthetic absorption line to 
be clustered in a perfectly recovered group (top panel), partially 
recovered group (middle panel) and incorrectly recovered group 
(bottom panel) as a function the slope of its flux relation and its 
correlation with reddening. 
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Figure B5. The log-uniform distribution of DIBs slopes (blue) 
and the slope threshold for correlation recovery (red). DIBs with 
slopes smaller than -2.9 are discarded from the analysis. 
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